This milestone report is part of the Capstone Course of the Data Science Specialization on Coursera. The goal of this project is to create a predictive text model using data from SwiftKey. The data was obtained via datascraping the internet for blogposts, news articles and tweets.
The data comes from three separate files. One for each of blogs, news and twitter data sources. I start the report by first analysing each dataset on it’s own. Later, I combine them and analyse them all together.
I will start by exploring which words are the most common in each of the datasets. Before doing so I will remove common words known as stop words. This is words like the, is, and, etc.
We see a lot of similarities between the different sources but also some differences. The twitter data contains more abbreviations like lol, news contains words like police and other official-sounding words. The blogs data seems to have the most ordinary corpus.
We get additional visualizations of the datasets using wordclouds. Wordclouds can give us a better conceptual understanding of the datasets. I have added a sentiment analysis onto the plots so we can also see which positive and negative words are contained in each dataset.